THE ISOLATES

X. Species Accession Abbr Subspecies Host Isolate Location Region Quality Year Longitude Latitude
1 C.tyzzeri unpublished EUR_M_866 IXb Mus AA_0866 Germany BR good 2022 13.55580 52.92060
2 C.tyzzeri unpublished EUR_M_942 IXb Mus AA_0942 Germany BR good 2022 13.28390 52.95760
3 C.tyzzeri unpublished EUR_M_900 IXb Mus AA_0900 Germany BR bad 2022 14.08330 52.69670
4 C.tyzzeri SRR5683558 USA_M_GA IXb Mus UGA55 USA GA excellent 2019 -83.40056 33.98009
5 C.tyzzeri SRR16934713 CHN_M_GU IXa Mus TYGZ1 China GU excellent 2021 113.36774 23.36025
6 C.parvum SRR15694518 EUR_H_CZ IIa Human Czech5 Czech Republic CZ excellent 2017 14.99937 49.60718
7 C.parvum SRR15694490 USA_H_OR IIc Human 43801 USA OR excellent 2016 -120.70518 43.83910
8 C.parvum SRR15694483 USA_H_AL1 IIc Human M1701011662 USA AL excellent 2017 -86.72669 32.92584
9 C.parvum SRR15694501 USA_H_AL2 IId Human 45175 USA AL excellent 2018 -86.72669 32.92584
10 C.parvum SRR15694519 EUR_C_CZ IIa Cattle Czech3 Czech Republic CZ excellent 2017 14.99937 49.60718
11 C.parvum SRR15694484 CHN_C_GU1 IId Cattle 22973 China GU excellent 2019 113.36774 23.36025
12 C.parvum SRR15694495 CHN_C_GU2 IId Cattle 22972 China GU excellent 2019 113.36774 23.36025
13 C.parvum SRR15694506 CHN_C_GU3 IId Cattle 22971 China GU excellent 2019 113.36774 23.36025
14 C.parvum SRR15694521 CHN_C_SH1 IId Cattle 22207 China SH excellent 2015 121.47260 31.23931
15 C.parvum SRR15694522 CHN_C_SH2 IId Cattle 22600 China SH excellent 2016 121.47260 31.23931
16 C.parvum SRR15694502 USA_C_WI IIa Cattle NAMHS-861 USA WI excellent 2015 -89.58182 44.69938
17 C.parvum SRR15694503 USA_C_WA1 IIa Cattle NAMHS-2153 USA WA excellent 2015 -120.15838 47.36275
18 C.parvum SRR15694504 USA_C_WA2 IIa Cattle NAMHS-1925 USA WA excellent 2015 -120.15838 47.36275

Samples with Illumina Sequencing Results

Spatial Distribution of Samples

WORKFLOW

Download data from Sequence Read Archive (SRA)

prefetch SRR*

Split files

fasterq-dump *.sra --split-files

QC and trimming

–in1 and –in2: specify your files of forward (1) reads and of the reverse (2) reads.

–out1 and –out2: specify the output files for forward and reverse reads that are still Paired.

-l 50: this specifies that if a read is shorter than 50 base pairs after all filters, it should be removed.

-h: specifies name for the html file with plots showing the read quality before and after filtering

PolyG tail trimming

This feature removes the polyG tails that arise from lack of signal in NextSeq/NovaSeq technologies. It is enabled for Nextseq/Novaseq data by default, and you can specify -g to enable it for any data, or specify -G to disable it.

Removal of adapter sequences

Adapter trimming is enabled by default, but you can disable it with -A. Adapter sequences can be automatically detected for both PE/SE data.

Length filter

Reads below the length threshold (e.g. due to adapter removal) are removed. Length filtering is enabled by default. The minimum length requirement is specified with -l.

Quality filtering

Quality filtering is enabled by default, but you can disable it with -Q. Currently fastp supports filtering by limiting the number of uncalled (N) bases (-n, Default 5) and the percentage of unqualified bases. To filter reads by its percentage of unqualified bases, two options should be provided:

-q : Quality threshold per base required. Default: 15, which means that a Phred quality score of at least 15 is required

-u : Percent of bases allowed to be below the quality threshold to keep the read (0~100). Default 40 means 40% bases can fail the quality threshold. If more bases fail, the read is removed

01_fastp.sh

#!/usr/bin/env bash

INDS=($(for i in /SAN/Ctyzzeri/all/fastq/*_1.fastq.gz; do echo $(basename -a -s _1.fastq.gz $i); done))

for IND in ${INDS[@]};
do

    ~/fastp.0.23.1/fastp \
    --in1 /SAN/Ctyzzeri/all/fastq/${IND}_1.fastq.gz \
    --in2 /SAN/Ctyzzeri/all/fastq/${IND}_2.fastq.gz \
    --out1 /SAN/Ctyzzeri/all/results/filteredReads/${IND}.trimmed.R1.fastq.gz \
    --out2 /SAN/Ctyzzeri/all/results/filteredReads/${IND}.trimmed.R2.fastq.gz -l 50 -h /SAN/Ctyzzeri/all/results/qc/filteredReads/${IND}.html &> /SAN/Ctyzzeri/all/results/qc/filteredReads/${IND}.log

done

Indexing the Reference Sequence

02_bwa_index.sh / 02_samtools_faidx.sh

#!/usr/bin/env bash
bwa index /SAN/Ctyzzeri/all/resources/CryptoDB-57_CtyzzeriUGA55_Genome.fasta

#!/usr/bin/env bash
samtools faidx /SAN/Ctyzzeri/all/resources/*.fasta

Mapping to the Reference Genome

Using alignment software, we essentially find where in the genome our reads originate from and then once these reads are aligned, we are able to either call variants or construct a consensus sequence for our set of aligned reads.

-M is a standard flag that tells bwa to mark any low quality alignments (i.e. split across long distances) as secondary - we need this for downstream compatability.

-t tells bwa how many threads (cores) on a cluster to use - this effectively determines its speed.

Following these options, we then specify the reference genome, the forward reads and the reverse reads. Finally we write the output of the alignment to a SAM file.

03_bwa_align.sh

#!/usr/bin/env bash

REF=/SAN/Ctyzzeri/all/resources/CryptoDB-57_CtyzzeriUGA55_Genome.fasta
INDS=($(for i in /SAN/Ctyzzeri/all/results/filteredReads/*R1.fastq.gz; do echo $(basename ${i%.R*}); done))

for IND in ${INDS[@]};
do

    # declare variables
    FORWARD=/SAN/Ctyzzeri/all/results/filteredReads/${IND}.R1.fastq.gz
    REVERSE=/SAN/Ctyzzeri/all/results/filteredReads/${IND}.R2.fastq.gz
    OUTPUT=/SAN/Ctyzzeri/all/results/align/${IND}_sort.bam

    # align and sort
    echo "Aligning $IND with bwa"
    bwa mem -M -t 16 $REF $FORWARD $REVERSE | \
    samtools view -b | \
    samtools sort -T ${IND} > $OUTPUT

done

Alternatively, this can be run in parallel after specification of all individuals in a list:

for i in /SAN/Ctyzzeri/all/fastq/*1.fastq.gz; do echo $(basename -a -s _1.fastq.gz $i); done > inds

The script can be run using parallel 'sh 03_parallel_align.sh {}' :::: inds

03_parallel_align.sh

#!/usr/bin/env bash

# align a single individual
REF=/SAN/Ctyzzeri/all/resources/CryptoDB-57_CtyzzeriUGA55_Genome.fasta

# declare variables
IND=$1

FORWARD=/SAN/Ctyzzeri/all/results/filteredReads/${IND}.trimmed.R1.fastq.gz
REVERSE=/SAN/Ctyzzeri/all/results/filteredReads/${IND}.trimmed.R2.fastq.gz
OUTPUT=/SAN/Ctyzzeri/all/results/align/${IND}_sort.bam

# align and sort
echo "Aligning $IND with bwa"
bwa mem -M -t 16 $REF $FORWARD $REVERSE | \
samtools view -b | \
samtools sort -T ${IND} > $OUTPUT

Removing Duplicates

Finally, we need to remove duplicate reads from the dataset to avoid PCR duplicates and technical duplicates which inflate our sequencing depth and give us false certainty in the genotype calls.

04_markdup.sh

#!/usr/bin/env bash

java -jar /home/finn/picard-tools-1.119/MarkDuplicates.jar \
 REMOVE_DUPLICATES=true \
 ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT \
 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 \
 INPUT=/SAN/Ctyzzeri/all/results/align/*.bam \
 OUTPUT=/SAN/Ctyzzeri/all/results/rmd/*.rmd.bam \
 METRICS_FILE=/SAN/Ctyzzeri/all/results/qc/rmd/*.rmd.bam.metrics

Indexing the BAM files

05_samtools_index.sh

#!/usr/bin/env bash

samtools index /SAN/Ctyzzeri/all/results/rmd/*.bam

Variant calling with bcftools

To call variants, we will first use the samtools mpileup tool to pileup our BAM files. What does this mean exactly? Well we will take all reads at a given position and call variants from the reads covering that position. We need to do this for all individuals. After performing the pileup, we than pass the output to bcftools call which will actually call variants.

For bcftools mpileup:

-a - Annotate the vcf - here we add allelic depth (AD), genotype depth (DP) and strand bias (SP).

-O - the output type. Here it is u which means we do not compress the output.

-f - specify the reference genome to call variants against.

For bcftools call:

-f - format fields for the vcf - here they are genotype quality (GQ) and genotype probability (GP).

-v - output variant sites only - i.e. ignore non-variant parts of the reads

-m- use bcftools multiallelic caller

-O- specify the output type, here it is z - i.e. gzipped (compressed) vcf

-o output path

06_mpileup.sh

#!/usr/bin/env bash

REF=/SAN/Ctyzzeri/all/resources/CryptoDB-57_CtyzzeriUGA55_Genome.fasta
OUT=/SAN/Ctyzzeri/all/results/vcf/variants.vcf.gz

bcftools mpileup -a AD,DP,SP -Ou -f $REF \
/SAN/Ctyzzeri/all/results/rmd/*.bam | \
bcftools call -f GQ,GP -m -O z -v --ploidy 1 -o $OUT

indexing the vcf file

VCF stands for ‘variant call format’ and is a standard format used for variant calling and in population genomics. Again a detailed specification can be found online. It can take a bit of getting used to but it is widely supported and very useful.

These are the first 11 fields of the vcf and they are always present. What do they mean?

  • CHROM - chromosome or scaffold id from the reference genome
  • POS - base pair reference position
  • ID - SNP id - blank in this case
  • REF- Reference base - A,C,G,T or N. More than one indicates an indel
  • ALT - Alternaate base - the alternate base called as a variant
  • QUAL - Phred quality score for alternate base call (site not individual)
  • FILTER - filter status - if PASS then all filters passed
  • INFO - additional info on each site - explanation stored in header
  • FORMAT- the format for the genotype fields

VCF format

07_vcfIndex.sh
bcftools index /SAN/Ctyzzeri/all/results/vcf/variants.vcf.gz

Generating statistics on the VCF with vcftools (local)

In order to generate statistics from our VCF and also actually later apply filters, we are going to use vcftools, a very useful and fast program for handling vcf files.

Determining how to set filters on a dataset is a bit of a nightmare - it is something newcomers (and actually experienced people too) really struggle with. There also isn’t always a clear answer - if you can justify your actions then there are often multiple solutions to how you set your filters. What is important is that you get to know your data and from that determine some sensible thresholds.

Luckily, vcftools makes it possible to easily calculate these statistics. In this section, we will analyse our VCF in order to get a sensible idea of how to set such filtering thresholds. The main areas we will consider are:

Depth:

You should always include a minimum depth filter and ideally also a maximum depth one too. Minimum depth cutoffs will remove false positive calls and will ensure higher quality calls too. A maximum cut off is important because regions with very, very high read depths are likely repetitive ones mapping to multiple parts of the genome.

Quality:

Genotype quality is also an important filter - essentially you should not trust any genotype with a Phred score below 20 which suggests a less than 99% accuracy.

Minor allele frequency:

MAF can cause big problems with SNP calls - and also inflate statistical estimates downstream. Ideally you want an idea of the distribution of your allelic frequencies but 0.05 to 0.10 is a reasonable cut-off. You should keep in mind however that some analyses, particularly demographic inference can be biased by MAF thresholds.

Missing data:

How much missing data are you willing to tolerate? It will depend on the study but typically any site with >25% missing data should be dropped.

For bigger files, it makes sense to generate a vcfrandomsample subset, but these files are small enough.

The entire script to generate statistics can be looped as follows:

08_vcfStats.sh

INDS=($(for i in ~/Documents/Github/Crypto/Genome_Analysis/products/vcf/*.vcf.gz; do echo $(basename -a -s *.vcf.gz $i); done))

for IND in ${INDS[@]};
do

    #### Calculate allele frequency
    vcftools --gzvcf ~/Documents/Github/Crypto/Genome_Analysis/products/vcf/${IND} --freq2 --out ~/Documents/Github/Crypto/Genome_Analysis/products/vcftools/${IND}

    #### Calculate mean depth of coverage per individual
    vcftools --gzvcf ~/Documents/Github/Crypto/Genome_Analysis/products/vcf/${IND} --depth --out ~/Documents/Github/Crypto/Genome_Analysis/products/vcftools/${IND}

    #### Calculate mean depth of coverage for  each site
    vcftools --gzvcf ~/Documents/Github/Crypto/Genome_Analysis/products/vcf/${IND} --site-mean-depth --out ~/Documents/Github/Crypto/Genome_Analysis/products/vcftools/${IND}

    #### Calculate the site quality score for each site
    vcftools --gzvcf ~/Documents/Github/Crypto/Genome_Analysis/products/vcf/${IND} --site-quality --out ~/Documents/Github/Crypto/Genome_Analysis/products/vcftools/${IND}

    #### Calculate the proportion of missing data per individual
    vcftools --gzvcf ~/Documents/Github/Crypto/Genome_Analysis/products/vcf/${IND} --missing-indv --out ~/Documents/Github/Crypto/Genome_Analysis/products/vcftools/${IND}

    #### Calculate the propoortion of missing data per site
    vcftools --gzvcf ~/Documents/Github/Crypto/Genome_Analysis/products/vcf/${IND} --missing-site --out ~/Documents/Github/Crypto/Genome_Analysis/products/vcftools/${IND}

    #### Calculate heterozygosity and inbreeding coefficient per individual
    vcftools --gzvcf ~/Documents/Github/Crypto/Genome_Analysis/products/vcf/${IND} --het --out      ~/Documents/Github/Crypto/Genome_Analysis/products/vcftools/${IND}

done

VARIANT-BASED STATISTICS

QUALITY:

Genotype quality is also an important filter - essentially you should not trust any genotype with a Phred score below 20 which suggests a less than 99% accuracy. We will set the default quality filter to 30, as a phred score of 30 represents a 1 in 1000 chance that a SNP call is erroneous.

  • QUAL=30
MINOR ALLELE FREQUENCY:

We will take a look at the distribution of allele frequencies. This will help inform our minor-allele frequency (MAF) thresholds.

The upper bound of the distribution is 0.5, which makes sense because if MAF was more than this, it wouldn’t be the MAF! How do we interpret MAF? It is an important measure because low MAF alleles may only occur in one or two individuals. It is possible that some of these low frequency alleles are in fact unreliable base calls - i.e. a source of error.

Setting MAF cutoffs is actually not that easy or straightforward. Hard MAF filtering (i.e. setting a high threshold) can severely bias estimation of the site frequency spectrum and cause problems with demographic analyses. Similarly, an excess of low frequency, ‘singleton’ SNPs (i.e. only occurring in one individual) can mean you keep many uninformative loci in your dataset that make it hard to model things like population structure.

Usually then, it is best practice to produce one dataset with a good MAF threshold and keep another without any MAF filtering at all.

summary(var_freq_05$maf) # Mean = 0.289     --> ?
summary(var_freq_07$maf) # Mean = 0.212     --> ?
summary(var_freq_18$maf) # Mean = 0.235     --> ?
  • (MAF=0.1) or (MAF=0.05) <- I’m not sure if/how to set this filter.
DEPTH:

Variant Mean Depth

Next we will examine the mean depth for each of our variants. This is essentially the number of reads that have mapped to this position. The output we generated with vcftools is the mean of the read depth across all individuals - it is for both alleles at a position and is not partitioned between the reference and the alternative.

Variant Depth with xlim(0,150) and ylim(0, 0.025) gives a better idea of the distribution.

We could set our minimum coverage at the 5 and 95% quantiles but we should keep in mind that the more reads that cover a site, the higher confidence our basecall is. 10x is a good rule of thumb as a minimum cutoff for read depth, although if we wanted to be conservative, we could go with 15x.

What is more important here is that we set a good maximum depth cufoff. As the outliers show, some regions clearly have extremely high coverage and this likely reflects mapping/assembly errors and also paralogous or repetitive regions. We want to exclude these as they will bias our analyses. Usually a good rule of thumb is something the mean depth x 2

summary(var_depth_05$mean_depth) # Mean = 42.82     x 2 =  85.64
summary(var_depth_07$mean_depth) # Mean = 67.5032   x 2 = 135.01
summary(var_depth_18$mean_depth) # Mean = 80.51647  x 2 = 161.03
  • MIN_DEPTH = 10
  • MAX_DEPTH = 85.64 / 135.01 / 161.03
MISSINGNESS:

Variant Missingness

Next up we will look at the proportion of missingness at each variant. This is a measure of how many individuals lack a genotype at a call site.

The results indicate relatively high missingness, especially among C.tyzzeri samples. One thing to note here is that vcftools inverts the direction of missigness, so our 10% threshold means we will tolerate 90% missingness. Different datasets will likely have different missingness profiles. Typically missingness of 75-95% is used. We will accept max. 15% missingness due to the C.tyzzeri dataset (=0.85).

summary(var_miss_05$fmiss) # Mean = 0.12570     --> MISS=0.85
summary(var_miss_07$fmiss) # Mean = 0.09642     --> MISS=0.90
summary(var_miss_18$fmiss) # Mean = 0.06108     --> MISS=0.90
  • MISS = 0.85 / 0.9 / 0.9

Individual based statistics

As well as a our per variant statistics we generated earlier, we also calculated some individual metrics too. WE can look at the distribution of these to get an idea whether some of our individuals have not sequenced or mapped as well as others. This is good practice to do with a new dataset. A lot of these statistics can be compared to other measures generated from the data (i.e. principal components as a measure of population structure) to see if they drive any apparent patterns in the data.

MEAN-DEPTH PER IND:

Because we are only plotting data for few individuals, the plot looks a little disjointed. While there is some evidence that some individuals were sequenced to a higher depth than others, there are no extreme outliers. So this doesn’t suggest any issue with individual sequencing depth.

MISSING DATA PER IND:

Next we will look at the proportion of missing data per individual. This is very similar to the missing data per site, but here we will focus on the fmiss column - i.e. the proportion of missing data.

Again this shows us, the proportion of missing data per individual is very small indeed. It ranges from 0.01 - 0.6, so we can safely assume that some of our individuals are sequenced well, while the extreme values at 0.6 indicate rather bad sequencing that probably arises from one of the C.tyzzeri samples (AA_0900 I would assume due to reportedly bad sequencing). We will therefore set a threshold at 0.1.

Heterozygosity and inbreeding coefficient per individual

We cannot assess heterozygosity due to the haploid nature of Cryptosporidium spp.

Technically, computing heterozygosity and the inbreeding coefficient (F) for each individual can quickly highlight outlier individuals that are e.g. inbred (strongly negative F), suffer from high sequencing error problems or contamination with DNA from another individual leading to inflated heterozygosity (high F), or PCR duplicates or low read depth leading to allelic dropout and thus underestimated heterozygosity (stongly negative F). However, note that here we assume Hardy-Weinberg equilibrium. If the individuals are not sampled from the same population, the expected heterozygosity will be overestimated due to the Wahlund-effect. It may still be worth to compute heterozygosities even if the samples are from more than one population to check if any of the individuals stands out which could indicate problems.

ALL STATISTICS SUMMARY

FILTERING BASED ON STATISTICS

Now we have an idea of how to set out thresholds, we will do just that and filter for certain attributes:

  • –gvcf - input path – denotes a gzipped vcf file

  • –remove-indels - remove all indels (SNPs only)

  • –maf - set minor allele frequency - 0.1 here

  • –max-missing - set minimum missing data. A little counterintuitive - 0 is totally missing, 1 is none - missing. Here 0.75 means we will tolerate 25% missing data.

  • –minQ - this is just the minimum quality score required for a site to pass our filtering threshold. Here we set it to 30.

  • –min-meanDP - the minimum mean depth for a site.

  • –max-meanDP - the maximum mean depth for a site.

  • –minDP - the minimum depth allowed for a genotype - any individual failing this threshold is marked as having a missing genotype.

  • –maxDP - the maximum depth allowed for a genotype - any individual failing this threshold is marked as having a missing genotype.

  • –recode - recode the output - necessary to output a vcf

  • –stdout - pipe the vcf out to the stdout (easier for file handling)

(vcftools is NOT available on Harriet..)

09_filterVCF.sh

#!/usr/bin/env bash

VCF_IN=~/Desktop/vcf/variants.vcf
VCF_OUT=~/Desktop/vcfFiltered/variants_filtered.vcf.gz

# set filters
MAF=  X
MISS= X
QUAL= X
MIN_DEPTH= X
MAX_DEPTH= X

# perform the filtering with vcftools
vcftools --gzvcf $VCF_IN \
--remove-indels --maf $MAF --max-missing $MISS --minQ $QUAL \
--min-meanDP $MIN_DEPTH --max-meanDP $MAX_DEPTH \
--minDP $MIN_DEPTH --maxDP $MAX_DEPTH --recode --stdout | gzip -c > \
$VCF_OUT

PRINCIPAL COMPONENT ANALYSIS (PCA)

Essentially, PCA aims to identify the main axes of variation in a dataset with each axis being independent of the next (i.e. there should be no correlation between them). The first component summarizes the major axis variation and the second the next largest and so on, until cumulatively all the available variation is explained. In the context of genetic data, PCA summarizes the major axes of variation in allele frequencies and then produces the coordinates of individuals along these axes.

LINKAGE PRUNING

Linkage Pruning

One of the major assumptions of PCA is that the data we use is independent - i.e. there are no spurious correlations among the measured variables. This is obviously not the case for most genomic data as allele frequencies are correlated due to physical linkage and linkage disequilibrium. So as a first step, we need to prune our data set of variants that are in linkage.

  • –vcf - specified the location of our VCF file.

  • –double-id - told plink to duplicate the id of our samples (this is because plink typically expects a family and individual id - i.e. for pedigree data - this is not necessary for us.

  • –allow-extra-chr - allow additional chromosomes beyond the human chromosome set. This is necessary as otherwise plink expects chromosomes 1-22 and the human X chromosome.

  • –set-missing-var-ids - also necessary to set a variant ID for our SNPs. Human and model organisms often have annotated SNP names and so plink will look for these. We do not have them so instead we set ours to default to chromosome:position which can be achieved in plink by setting the option @:# - see here for more info.

  • –indep-pairwise - finally we are actually on the command that performs our linkage pruning! The first argument, 50 denotes we have set a window of 50 Kb. The second argument, 10 is our window step size - meaning we move 10 bp each time we calculate linkage. Finally, we set an r2 threshold - i.e. the threshold of linkage we are willing to tolerate. Here we prune any variables that show an r2 of greater than 0.1.

  • –out Produce the prefix for the output data.

    10_linkagePruning.sh

    VCF=/SAN/Ctyzzeri/all/results/vcfFiltered/*_samples_filtered.vcf.gz

    ~/plink-1.9/plink
    –vcf $VCF –double-id –allow-extra-chr
    –set-missing-var-ids @:#$1,$2
    –indep-pairwise 50 10 0.1 –out *_samples

PCA

We can rerun plink with a few additional arguments to get it to conduct a PCA.

  • –extract - this just lets plink know we want to extract only these positions from our VCF - in other words, the analysis will only be conducted on these.

  • –make-bed - this is necessary to write out some additional files for another type of population structure analysis - a model based approach with admixture.

  • –pca - fairly self explanatory, this tells plink to calculate a principal components analysis.

PCA output:

  • cichlids.eigenval - the eigenvalues from our analysis

  • cichlids.eigenvec- the eigenvectors from our analysis

Plink binary output:

  • cichlids.bed - the cichlids bed file - this is a binary file necessary for admixture analysis. It is essentially the genotypes of the pruned dataset recoded as 1s and 0s.

  • cichlids.bim - a map file (i.e. information file) of the variants contained in the bed file.

  • cichlids.fam - a map file for the individuals contained in the bed file.

    11_pca.sh

    #!/usr/bin/env bash

    INDS=($(for i in /SAN/Ctyzzeri/all/results/vcfFiltered/*_filtered.vcf.gz; do echo $(basename -a -s _filtered.vcf.gz $i); done))

    for IND in ${INDS[@]}; do

      ~/plink-1.9/plink \
      --vcf /SAN/Ctyzzeri/all/results/vcfFiltered/15_samples_filtered.vcf.gz --double-id --allow-extra-chr \
      --set-missing-var-ids @:#\$1,\$2 \
      --extract /SAN/Ctyzzeri/all/results/pcaFiltered/15_samples.prune.in  \
      --make-bed --pca --out /SAN/Ctyzzeri/all/results/pcaFiltered/15_samples

    done

First we will plot the eigenvalues. It is quite straightforward to translate these into percentage variance explained (although note, you could just plot these raw if you wished). Let’s take a look at the the bar plot showing the percentage of variance each principal component explains.

C.TYZZERI SSP. (n=5)

set filters:

  • QUAL = 30
  • MAF = 0.1
  • MISS = 0.85
  • MIN_DEPTH = 10
  • MAX_DEPTH = 2*42.82
##   PC          pve
## 1  1 5.523559e+01
## 2  2 2.740809e+01
## 3  3 1.153670e+01
## 4  4 5.819619e+00
## 5  5 1.110220e-14
## [1]  55.23559  82.64368  94.18038 100.00000 100.00000

TYZ-PAR-HOM PCA (n = 7) filtered based on statistics (STATS) - new -

set filters:

  • QUAL = 30
  • MAF = 0.1
  • MISS = 0.9
  • MIN_DEPTH = 10
  • MAX_DEPTH = 2*67.5032
##   PC       pve
## 1  1 44.614246
## 2  2 23.711121
## 3  3 15.957667
## 4  4  9.845508
## 5  5  6.373428
## 6  6  2.706563
## 7  7 -3.208533
## [1]  44.61425  68.32537  84.28303  94.12854 100.50197 103.20853 100.00000

TYZ-PAR-PCA (n = 7) filtered based on statistics (STATS) - old -

set filters: MAF=0.05 MISS=0.90 QUAL=30 MIN_DEPTH=10 MAX_DEPTH=50

##   PC       pve
## 1  1 36.023347
## 2  2 23.768769
## 3  3 17.840759
## 4  4 14.314135
## 5  5  7.545572
## 6  6  3.121628
## 7  7 -2.614211
## [1]  36.02335  59.79212  77.63288  91.94701  99.49258 102.61421 100.00000

TYZZERI VS. PARVUM (n = 18)

set filters:

  • QUAL = 30
  • MAF = 0.1
  • MISS = 0.9
  • MIN_DEPTH = 10
  • MAX_DEPTH = 2*80.51647
##    PC         pve
## 1   1 47.25446176
## 2   2 19.44401083
## 3   3 15.69646808
## 4   4  6.92581110
## 5   5  4.19130438
## 6   6  3.14121741
## 7   7  2.71866689
## 8   8  1.17385133
## 9   9  0.86905809
## 10 10  0.66560409
## 11 11  0.38748094
## 12 12  0.21926286
## 13 13  0.20130768
## 14 14  0.06817536
## 15 15 -0.01918241
## 16 16 -0.04349182
## 17 17 -0.31704581
## 18 18 -2.57696078
##  [1]  47.25446  66.69847  82.39494  89.32075  93.51206  96.65327  99.37194
##  [8] 100.54579 101.41485 102.08045 102.46793 102.68720 102.88851 102.95668
## [15] 102.93750 102.89401 102.57696 100.00000

Looking at PC2 and PC3 in multi-species PCAs (n = 7) and (n = 18)

TYZ-PAR-HOM PCA (n = 7) - new -

set filters:

  • QUAL = 30
  • MAF = 0.1
  • MISS = 0.9
  • MIN_DEPTH = 10
  • MAX_DEPTH = 2*67.5032

TYZ-PAR-PCA (n = 7) filtered based on statistics (STATS) - old -

set filters: MAF=0.05 MISS=0.90 QUAL=30 MIN_DEPTH=10 MAX_DEPTH=50

TYZZERI VS. PARVUM (n = 18)

set filters:

  • QUAL = 30
  • MAF = 0.1
  • MISS = 0.9
  • MIN_DEPTH = 10
  • MAX_DEPTH = 2*80.51647

Genes in filtered, pruned VCF –> divergent, unlinked genes that impact PCA

GeneID TYZ TP TPH incl_in Chr Product
CTYZ_00000303 TRUE TRUE TRUE 3 Ctyz_8 Signal[space]peptide[space]LNRspace[space]repeat[space]containing[space]protein
CTYZ_00000497 TRUE TRUE TRUE 3 Ctyz_8 Uncharacterized[space]protein
CTYZ_00000535 TRUE TRUE TRUE 3 Ctyz_8 Alpha-trehalose-phosphate[space]synthase
CTYZ_00001012 TRUE TRUE TRUE 3 Ctyz_7 Cryptopsoridial[space]mucin
CTYZ_00001204 TRUE TRUE TRUE 3 Ctyz_6 Uncharacterized[space]protein
CTYZ_00002289 TRUE TRUE TRUE 3 Ctyz_2 Homologous-pairing[space]protein[space]2
CTYZ_00002308 TRUE TRUE TRUE 3 Ctyz_2 SPX/RING-type[space]Zinc[space]finger[space]domain[space]containing[space]protein
CTYZ_00003509 TRUE TRUE TRUE 3 Ctyz_5 Calponin[space]homology[space]domain[space]containing[space]protein

Detecting Structure with ADMIXTURE

ADMIXTURE is a clustering software similar to STRUCTURE with the aim to infer populations and individual ancestries.

Prepare cluster analysis via filtered vcf files.

13_admixture_prep.sh

#!/usr/bin/env bash

# FILE=05_samples_filtered_stats2
# FILE=07_samples_filtered_stats2
# FILE=18_samples_filtered_stats2

# Generate the input file in plink format
plink --vcf ~/Documents/Github/Crypto/GenomeAnalysis/products/vcfFiltered/$FILE.vcf.gz \
--make-bed --out $FILE --allow-extra-chr --const-fid 0

# ADMIXTURE does not accept chromosome names that are not human chromosomes. We will thus just exchange the first column by 0
awk '{$1="0";print $0}' $FILE.bim > $FILE.bim.tmp
mv $FILE.bim.tmp $FILE.bim

Execute individual lines or loop over K=3 to K=5 and generate log files.

14_admixture_A.sh

#!/usr/bin/env bash
FILE=~/Documents/Github/Crypto/Genome_Analysis/products/admixture/05_samples_filtered_stats2

admixture --cv $FILE.bed 2 > log2.out
admixture --cv $FILE.bed 3 > log3.out
admixture --cv $FILE.bed 4 > log4.out
admixture --cv $FILE.bed 5 > log5.out

FILE=~/Documents/Github/Crypto/Genome_Analysis/products/admixture/07_samples_filtered_stats2

admixture --cv $FILE.bed 2 > log2.out
admixture --cv $FILE.bed 3 > log3.out
admixture --cv $FILE.bed 4 > log4.out
admixture --cv $FILE.bed 5 > log5.out

FILE=~/Documents/Github/Crypto/Genome_Analysis/products/admixture/18_samples_filtered_stats2

admixture --cv $FILE.bed 2 > log2.out
admixture --cv $FILE.bed 3 > log3.out
admixture --cv $FILE.bed 4 > log4.out
admixture --cv $FILE.bed 5 > log5.out

----------------------------------------------------------------------------

for i in {3..5}
do
  admixture --cv $FILE.bed $i > log${i}.out
done

ADMIXTURE produced 2 files: .Q which contains cluster assignments for each individual and .P which contains for each SNP the population allele frequencies.

To identify the best value of k clusters which is the value with lowest cross-validation error, we need to collect the cv errors. Below are three different ways to extract the number of K and the CV error for each corresponding K.

grep "CV" *out | awk '{print $3,$4}' | sed -e 's/(//;s/)//;s/://;s/K=//'  > $FILE.cv.error

OR

grep "CV" *out | awk '{print $3,$4}' | cut -c 4,7-20 > $FILE.cv.error

OR

awk '/CV/ {print $3,$4}' *out | cut -c 4,7-20 > $FILE.cv.error

The cv errors of all three PCAs indicate that K=4 is the best value of clusters.

cv_error_05_samples cv_error_07_samples cv_error_18_samples
2 0.38735 2 0.38735 2 0.38735
3 0.28832 3 0.28832 3 0.28832
4 0.22959 4 0.22959 4 0.22959
5 0.30370 5 0.30370 5 0.30370

To make plotting easier, we can make a file with the individual names in one column and the species names in the second column. It should look somewhat like this:

Ind Species

866 TYZ1

900 TYZ2

942 TYZ3

NZL_H HOM

TYGZ1 TYZ4

UGA55 TYZ5

USA_P PAR

The following R script was written by Joana Meier in 2019 and helps to generate plot. It requires four arguments:

wget https://github.com/speciationgenomics/scripts/raw/master/plotADMIXTURE.r
chmod +x plotADMIXTURE.r
  1. the prefix for the ADMIXTURE output files (-p )
  2. the file with the species information (-i )
  3. the maximum number of K to be plotted (-k 5)
  4. and a list with the populations or species separated by commas (-l <pop1,pop2…>).

The list of populations provided with -l gives the order in which the populations or species shall be plotted. Note, that alternatively, if working with your own data, you could also try this for plotting.

The script by Joana Meier can be run like this:

#!/usr/bin/env bash

FILE=05_samples_filtered_stats2

Rscript plotADMIXTURE.r \
-p ~/Documents/Github/Crypto/Genome_Analysis/products/admixture/$FILE \
-i ~/Documents/Github/Crypto/Genome_Analysis/products/admixture/$FILE.list \
-k 5 -l TYZ1,TYZ2,TYZ3,TYZ4,TYZ5


FILE=07_samples_filtered_stats2

Rscript plotADMIXTURE.r \
-p ~/Documents/Github/Crypto/Genome_Analysis/products/admixture/$FILE \
-i ~/Documents/Github/Crypto/Genome_Analysis/products/admixture/$FILE.list \
-k 5 -l TYZ1,TYZ2,TYZ3,HOM,TYZ4,TYZ5,PAR


FILE=18_samples_filtered_stats2

Rscript plotADMIXTURE.r \
-p ~/Documents/Github/Crypto/Genome_Analysis/products/admixture/$FILE \
-i ~/Documents/Github/Crypto/Genome_Analysis/products/admixture/$FILE.list \
-k 5 -l CHN_C_GU1,CHN_C_GU2,CHN_C_GU3,CHN_C_SH1,CHN_C_SH2,CHN_M_GU,EUR_C_CZ,EUR_H_CZ,EUR_M_866,EUR_M_900,EUR_M_942,USA_C_WA1,USA_C_WA2,USA_C_WI,USA_H_AL1,USA_H_AL2,USA_H_OR,USA_M_GA

Clustering of TYZ internal (05_samples)

Clustering of TYZ internal (05_samples)

Clustering of TYZ-PAR-HOM (07_samples)

Clustering of TYZ-PAR-HOM (07_samples)

Clustering TYZ-PAR (18_samples)

Clustering TYZ-PAR (18_samples)